Data Centers
As internet adoption grew, computing shifted from local machines to centralized data centers, large facilities housing thousands of servers and providing computational power and storage for various applications.
Data centers are geographically distributed in areas with favorable cooling and power conditions, reducing user latency and improving fault tolerance.
Benefits of Centralized Computing
User Benefits:
- Ease of management: The user doesn’t need to worry about hardware maintenance, backups, software updates, or security patches.
- Ubiquity: Users can access their applications and data from any device with an internet connection, enabling remote work and global collaboration.
- Compute power: Some applications require more computational resources than a typical personal computer can provide. Cloud computing allows users to access powerful servers that can handle intensive workloads, such as data analysis, machine learning, and video rendering.
Vendor Benefits:
- Homogeneity: The vendor can optimize their code for a specific hardware configuration, instead of caring about the wide variety of user hardware.
- Change management: The vendor can update the code and deploy it to all users at once, without relying on users to update their local software.
Infrastructure Benefits:
- Scalability: Servers can be added or removed based on demand, enabling efficient resource allocation.
- Cost-effectiveness: Providers achieve economies of scale, reducing per-user costs.
- Multi-tenancy: Multiple customers share the same data center, maximizing server utilization instead of remaining idle.
Warehouse-Scale Computing
Warehouse-scale computing (WSC) is a type of data center architecture that treats thousands of interconnected servers as a single unified system.
This enables running large-scale applications (search engines, social media platforms, online gaming services) that require significant computational resources to be efficiently managed and scaled.
Many such providers also offer cloud services, virtualizing their infrastructure for external customers, allowing a traditional data center to be built on top of a warehouse-scale computing infrastructure.
Geographic Distribution
Global cloud infrastructure is hierarchically organized for redundancy and low latency:
- Geographical Area (GA): Determines data residence requirements.
- Computing Regions: At least two per GA, separated by ≥100 miles to avoid common failures (earthquakes, natural disasters). Allows disaster recovery but too distant for synchronous replication.
- Availability Zones: Multiple zones (min. 3) within a region, isolated yet close enough for synchronous replication. Provides a finer-grained redundancy level for critical applications enabling faster recovery from failures.
- Edge Locations: Smaller data centers closer to users. Used for CDNs and caching to reduce latency and improve content delivery speed.
- Local Zones: Metropolitan-area data centers providing ultra-low latency for location-specific applications.
Physical Architecture
Data center architecture is similar to that of personal computers but at a massive scale.
Computing Components:
- Servers: Standardized physical machines providing computational power and storage. Functionally equivalent to regular computers.
- Networking equipment: Switches, routers, and firewalls connecting servers and providing internet access.
- Storage systems: Additional storage capacity for applications and data.
Support Infrastructure:
- Power supply: Continuous, reliable power delivery. Redundancy includes backup generators and uninterruptible power supplies (UPS) to handle outages.
- Cooling systems: Manage server heat through air conditioning, liquid cooling, or hybrid approaches to maintain optimal operating temperatures.
- Failure recovery: Ensures system availability via batteries, diesel generators, and other redundancy mechanisms.
Server
Servers are fundamental computing units in data centers, designed for performance, reliability, and scalability.
Form Factors
Servers are made in different standard form factors such us:
- Rack-mounted (most common): Standardized units (1U = 44.45 mm height) fitting into vertical racks. Racks integrate power distribution, cooling, networking, and cable management, enabling high density and efficient operations. Excellent space efficiency but complex cable management at scale.
- Blade servers: Vertically oriented, ultra-compact form factor. Highest component density per unit of space but requires specialized cooling due to high power density. These are more expensive than rack-mounted.
- Tower servers: Standalone units resembling desktop computers. Low density with simple cooling/maintenance, and lower cost. Rarely used in modern data centers due to poor scalability and space inefficiency.
Components
- Motherboard: Central circuit board interconnecting all components.
- CPUs: 1 to 8 processors per server.
- RAM: 2 to 192 DIMM slots for main memory.
- Storage: Multiple hard drives or SSDs for persistent data.
- Specialized Hardware (optional):
- GPUs: Accelerate parallel compute tasks (machine learning, scientific computing). Communicate via NVLink (high-speed interconnect) to minimize latency bottlenecks.
- TPUs: Tensor Processing Units specialized for neural network training/inference thanks to optimized matrix operations.
- FPGAs: Field-Programmable Gate Arrays. Customizable hardware programmed for specific low-latency, application-specific acceleration (real-time processing, network processing).
All components are standardized for quick replacement and maintenance, with hot-swappable parts to minimize downtime.
Thermal Management
Data centers uses cold aisle/warm aisle configuration to maximize the air cooling efficiency:
- Cold aisle: Center floor supplies cold air; flows through server intake ports.
- Warm aisle: Rear of servers; hot exhaust air expelled upward.
- Containment: Roof caps on racks prevent cold air bypass, forcing air through servers and maximizing cooling efficiency.
Storage
WIth time the data have been moved from local towards cloud providers. This is due:
- Ease of management, with automatic backups and data recovery;
- Low price;
- Ease of access everywhere there is an internet connection.
File System Abstractions
OS manages data through hierarchical abstractions:
- Data Blocks: Smallest units of storage, addressable with logical block addresses (LBA).
- Clusters: Groups of contiguous blocks, used to reduce overhead compared to block-level management, reducing the number of read/write operations. Inside each cluster there are the actual data and the Metadata that consist in the file attributes (name, size, permissions, timestamps) enabling organization and access control.
During data deletion, the cluster is only flagged as deleted, allowing it to be overwritten.
Space Allocation
The storage unit is represented as a multiple of the cluster size:
\text{Disk Size} = \lceil \frac{\text{File Size}}{\text{Cluster Size}} \rceil \times \text{Cluster Size}
when a file is smaller than the cluster size, sime of its space is wasted leading to internal fragmentation:
\text{Wasted Space} = \text{Disk Size} - \text{File Size} When a file’s clusters are non-contiguous (fragmented), read/write operations require multiple seeks, degrading performance. In these cases it’s useful to perform defragmentation to rearrange sectors into sequential blocks.
Hard Disk Drives
Physical Structure:
HDDs contain rotating magnetic platters coated with ferromagnetic material. Data is stored as magnetic patterns organized into:
- Tracks: Concentric circles on each platter.
- Sectors: Divisions of tracks, the smallest atomic read/write unit.
The platters are mounted on a spindle and spin at high speeds (RPM). An actuator arm with a read/write head moves across the platters to access data.
The entire assembly is enclosed in a sealed case to protect against dust, scratches, and environmental contaminants, while also providing shock resistance.
Access Time Components
During the read/write process, several time components contribute to the total access time:
- Seek Time: Time for actuator arm to position head over target track. Heuristic: T_\text{Seek} \approx \frac{T_\text{max}}{3}
- Rotation Delay: Average time for target sector to rotate under head: T_\text{Rotation} = \frac{1}{2} \times \frac{60}{\text{RPM}}
- Transfer Time: Duration to read/write data at disk transfer rate, based on the amount of data and the disk’s throughput.
- Controller Overhead: Command processing and disk preparation time.
The total access time is the sum of these components: T_\text{Access} = T_\text{Seek} + T_\text{Rotation} + T_\text{Transfer} + T_\text{Controller}
The Data locality, the tendency for related data to be stored close together, can significantly reduce access time by minimizing seek and rotation delays. The locality factor is represented by \alpha. The adjusted access time considering locality is:
T_\text{Access} = (1-\alpha)(T_\text{Seek} + T_\text{Rotation}) + T_\text{Transfer} + T_\text{Controller}
To reduce access time the HDDs include buffer memory that exploits spatial locality by storing neighboring sectors.
Writes target cache first, then flush to platters. This reduces repeated disk access for frequently accessed data.
Scheduling
When multiple I/O requests are fired, the disk scheduler determines the order of processing. The goal is to minimize total access time and maximize throughput. This introduces a Scheduling Delay as the disk may need to wait for the current request to finish before processing the next one. Common scheduling algorithms include:
- FCFS (First-Come, First-Served): Requests are processed in the order they arrive.
- SSTF (Shortest Seek Time First): The request with the shortest seek time is processed next, might lead to starvation.
- SCAN: The read/write head moves in one direction, processing requests as it goes, and then reverses direction.
- C-SCAN: The read/write head moves in one direction, processing requests as it goes, and then jumps back to the beginning without reversing direction.
- C-LOOK: Similar to C-SCAN, but reverse at the last I/O request in one direction instead of reaching the extreme end of the disk.
Solid State Drives
SSDs use flash memory (no mechanical parts), managed by a silicon controller and uses the same form factors as HDDs.
At the beginning of its life, an SSD is faster than an HDD because it has no seek time or rotation delay. However, as the SSD fills up and undergoes more write cycles, its performance can degrade, mainly for writes.
Data is organized into:
- Cell: Floating gate transistor storing charge
(presence/absence = bit). Oxide layer insulates gate, retaining charge
without power.
- Cell Density: SLC (1 bit), MLC (2 bits), TLC (3 bits).
- Page: Smallest readable/writable unit.
- Block: Smallest erasable unit (contains multiple pages).
Each cell has a limited number of write cycles, leading to wear-out as the oxide layer degrades.
Each page can be in one of three states:
- Valid: Contains readable data.
- Dirty: Contains obsolete data, eligible for erasure.
- Empty: Block erased, ready for writing.
Writes always target empty pages; updates write to new pages and mark old pages dirty. Blocks containing only dirty pages can be erased. This prevents repeated wear on single cells but introduces challenges:
- Write Amplification: When a page update it requires copying valid data to a new block and erasing the old block, leading to more writes than the original update, which can degrade performance and reduce SSD lifespan.
- Garbage Collection: Identifies dirty-heavy blocks, copies valid data to new blocks, erases old blocks, and marks them empty.
To mitigate wear-out, SSDs implement a technique called Wear Leveling that distributes write/erase cycles evenly across cells. Periodically relocates data to ensure \text{max cycles} - \text{min cycles} < e (small threshold).
SSDs uses Flash Translation Layer (FTL), a firmware component that manages the mapping between logical block addresses (LBAs) used by the operating system and the physical addresses of the flash memory. Mapping strategies include:
- Page-level: Fine-grained but expensive.
- Block-level: Coarser, faster.
- Hybrid: Balances cost and performance.
Reliability
Unrecoverable Bit Error Ratio (UBER) differs from HDDs: HDD UBER increases linearly with age while SSD UBER change over time, starting low, increasing as the drive wears out, and then rising sharply near the end of its lifespan.
Storage Architectures
Direct Attached Storage (DAS)
Direct Attached Storage is physically connected to a single server (internal or external via SATA, USB).
- Pros: High performance, no network latency.
- Cons: Limited scalability, no sharing, complex backup/recovery.
Network Attached Storage (NAS)
Network Attached Storage is storage that is connected to a network and has its own IP address, appearing as a file server. It provides file-level access to data over the network, allowing multiple clients to access and share files simultaneously.
- Pros: Easy to scale, centralized backup.
- Cons: Network bandwidth bottleneck, higher latency than DAS.
Storage Area Network (SAN)
Storage Area Network is a network that provides block-level access to data, allowing servers to access storage as if it were directly attached.
Dependability
Systems fail due to: defects, degradation, radiation, design errors, bugs, attacks, and human errors. This leads to economic losses, information loss, physical harm, and reputation damage.
Dependability is a measure of trust toward a system. It comprises five key attributes:
- Reliability: Ability of a system to perform its intended functions under specified conditions for a defined period of time.
- Availability: The degree to which a system is operational and accessible when required for use. Formula: A = \frac{\text{Uptime}}{\text{Uptime} + \text{Downtime}}
- Maintainability: The ease with which a system can be repaired, modified, and restored to working condition.
- Safety: Absence of catastrophic consequences to users or the environment.
- Security: Protection of a system from unauthorized access and interference, maintaining confidentiality, integrity, non-repudiation, and survivability.
Fault-Error-Failure Chain
A fault is a defect or anomaly in a system.
When a fault is activated, it becomes an error, a deviation from correct operation.
If an error is not detected and corrected, it propagates and ultimately causes a failure, meaning that the system ceases to perform its intended function.
Dependability Approaches
Two primary techniques address dependability:
- Fault Avoidance: Preventing faults from occurring through rigorous testing, validation, formal verification, and use of fault-tolerant components.
- Fault Tolerance: Building systems that continue operating correctly despite faults through error detection, monitoring, self-recovery mechanisms, redundancy, and graceful degradation.
This is a tradeoff between cost (hardware, performance, and development), performance, and dependability. Design decisions depend on: technologies, requirements, context, and environment.
Reliability Metrics
Reliability follows an exponential failure model: R(t) = e^{-\lambda t}
where:
- t is the time period of interest
- \lambda (lambda) is the constant failure rate (failures per unit time)
- R(t) is the probability that the system operates without failure during time t
Mean Time To Failure (MTTF): expected time until first failure: \text{MTTF} = \int_0^\infty R(t) \, dt = \frac{1}{\lambda}
Mean Time To Repair (MTTR): expected time to detect, repair, and recover: \text{MTTR} = t_{\text{detect}} + t_{\text{repair}} + t_{\text{recover}}
Mean Time Between Failures (MTBF): expected time between consecutive failures in repairable systems: \text{MTBF} = \text{MTTF} + \text{MTTR}
Availability formula: A = \frac{\text{MTTF}}{\text{MTTF} + \text{MTTR}} = \frac{\text{uptime}}{\text{uptime} + \text{downtime}}
Failures In Time (FIT): number of failures per billion device-hours: \text{FIT} = \frac{10^9}{\text{MTBF}}
Component Lifecycle
A component experiences three phases during its operational lifetime:
- Infant Mortality: Early phase with high failure rates; failures occur due to manufacturing defects and design issues.
- Useful Life: This is the primary operating window where the failure rate is relatively low and stable.
- Worn-out: Late phase with increasing failure rate where the component deteriorates due to age and use. Maintenance and eventual replacement become necessary.
System updates and new deployments risk introducing failures into production. Some common strategies mitigate this risk:
- Staged Rollout: Deploy changes gradually to an increasing fraction of users or systems detecting issues early before full deployment.
- Canary Deployment: Deploy to a small, representative subset (canaries) to validate behavior in production before rolling out to all systems.
- Automatic Rollback: Monitor deployed changes and automatically revert to a previous stable version if failures or anomalies are detected.
Reliability Block Diagram
The system structure is represented as a block diagram where each component is a block and links show dependencies.
A system functions if there exists at least one operational path from start to end.
Connections represent two reliability configurations:
Series Configuration (both components required): The system fails if any single component fails. Overall reliability decreases with each additional series component. R_s(t) = \prod_{i=1}^{n} R_i(t) = e^{-t\sum_{i=1}^{n}\lambda_i}
Parallel Configuration (at least one component required): The system continues if any single component survives. Overall reliability increases with redundancy. R_p(t) = 1 - \prod_{i=1}^{n}(1 - R_i(t))
Standby Redundancy: A redundant component remains idle until the primary component fails, then automatically activates. This approach approximately doubles the MTTF compared to a single component.
r-out-of-n Redundancy
A system that requires r out of n components to function correctly for the system to operate.
The system reliability for r-out-of-n redundancy (assuming identical components, each with reliability R):
R_{\text{voting}} = \sum_{i=r}^{n} \binom{n}{i} R^i(1-R)^{n-i}
This formula sums the probability that at least r components are operational.
When the majority of components must be operational, the reliability of a single component could be higher than the reliability of the entire system, especially when the failure rate \lambda is high.